16-820: Advanced Computer Vision - HW6 - 2024ΒΆ

IntroductionΒΆ

In this homework we will work with the state-of-the-art foundation model for segmentation, SAM - Paper. Foundation models are large deep learning neural networks that are trained on massive datasets. SAM and other segmentation models use input prompts such as a box around an object to generate a mask of just the object. We will learn how to run off-the-shelf deep learning models and use them in your work. We will use 2D segmentation masks, combined with camera geometry and depth, to arrive at dense 3D point clouds from 2D segmentations. The steps you will implement are:

  • Run SAM on a single image from the dataset.
  • Project the mask to 3D points in world coordinates.
  • For all the unseen views, do:
    1. Project all points in world coordinates to the image frame.
    2. Automatically generate a new input to SAM and run SAM
    3. Project new mask to world coordinates and append to existing coordinates.
  • Filter the point cloud using an off-the-shelf filtering approach.

Homework introduction video here. To submit this homework to gradescope, please submit your code and the output of your code. E.g., for Q4 show the function you implemented and the visualizations created in the for loop. For Q1.1, give the K matrix you computed.

Instructor: Matthew O'Toole
OUT: November 21st, 2024
DUE: December 6th, 2024
TA's: Nikhil Keetha, Ayush Jain, Yuyao Shi

DefinitionsΒΆ

A couple definitions that will hopefully avoid confusion:

These are the existing frames:

  • Camera frame/OpenCV Camera Frame: This is the reference frame for 3D points with respect to the camera. This is the camera frame discussed in class.
  • Blender Camera Frame: This is the camera frame for 3D points used in Blender, the y and z axes point in opposite direction w.r.t. the OpenCV camera frame.
  • World Frame: This is the frame for 3D points with respect to the world origin. This frame differs by a rigid body transformation from any camera frame.
  • Image Frame: The 2D points in the image, e.g., u, v in range [0,H] and [0,W].

If we refer to a prompt we mean the box around an object that is to be segmented, which is the input to SAM.

How to run this homeworkΒΆ

We will use deep neural networks which require a cuda-enabled graphics card with at least 12GB of VRAM. The easiest way to get access to this is Google Colab, press the button below to open this homework in Google Colab. You'll find a pretty useful tutorial on how to use Google Colab here.

How to submit this homeworkΒΆ

  1. You first press run all in the notebook and make sure that all plots come from your code, and are not the plots in the notebook by default.
  2. Then export as PDF, for the written version of the homework simply submit this and make sure to select all your results and code for the question.
  3. For the code, submit your iPython notebook file (.ipynb). We can use this to check your homework runs and gives the correct output. If we discover your code cannot reproduce the answer submitted in the written part, you will receive zero points for the question.
  4. No requirement on filenames.

FAQsΒΆ

  1. Hint: For Q3.1, remember the difference between the image frame (x horizontal), and the coordinates you might get from the mask. E.g., when you retrieve coordinates from the mask, they would not immediate align with the coordinates expected by the intrinsic matrices

  2. You should not have to modify the viz_pts_3d function.

  3. You should only call filter_points if thresh is not None, e.g. if thresh is not None: (call filter_points)

  4. Use another google account, or use Kaggle if you can't connect to Google colab.

  5. The function mask2cam should contain an if statement to check if thresh is None. If thresh is None, do not run filter_points().

  6. You should not have to change any of the given plotting functions, if you do there is probably an error in your code.

  7. In Q4, you should not have to change any code in the main for loop. Only the functions we have separated, i.e., cam2img, keep_dist, filter_for_box, prompt_points_to_box.

  8. Filter_for_box should notΒ take in K as input, the input of the function is already in world coordinates

  9. You shouldΒ notΒ add a thresh argument in mask2cam in the for loop in Q4, this is by design.

  10. Filtering in filter_for_box should happen in world frame.

  11. If you are getting unexpected results in Q4, you might be doing the correction for the Blender frame wrong. Remember, we have OpenCV Frame <-> Blender Frame <-> World frame. The 'transforms' given are Blender Frame <-> World frame, make corrections accordingly.

  12. Another common source of error in Q4 is not getting the correct inverse of the transform. Hint: Google 'inverse of rigid body transforms'.

  13. To get a pdf for this homework, please follow the following steps:

    • Download the python notebook
    • Open it in jupyter notebook
    • Download as html
    • Save as pdf
Open In Colab

Environment Set-up with Google ColabΒΆ

If running from Google Colab, set using_colab=True below and run the cell. In Colab, be sure to select 'GPU' under 'Edit'->'Notebook Settings'->'Hardware accelerator'.

InΒ [1]:
using_colab = True
InΒ [2]:
if using_colab:
    # install everything
    import torch
    import torchvision
    print("PyTorch version:", torch.__version__)
    print("Torchvision version:", torchvision.__version__)
    print("CUDA is available:", torch.cuda.is_available())
    import sys
    !{sys.executable} -m pip install opencv-python matplotlib os
    ! pip install open3d
    !{sys.executable} -m pip install 'git+https://github.com/facebookresearch/segment-anything.git'

    !mkdir ckpts
    !wget -P ckpts https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
    ! pip install gdown
    ! gdown 1K375xNjWAwZ7kmhjTuJccC3y6Ik5tN4q #download dataset from gdrive
    ! unzip images.zip
PyTorch version: 2.5.1+cu121
Torchvision version: 0.20.1+cu121
CUDA is available: True
Requirement already satisfied: opencv-python in /usr/local/lib/python3.10/dist-packages (4.10.0.84)
Requirement already satisfied: matplotlib in /usr/local/lib/python3.10/dist-packages (3.8.0)
ERROR: Could not find a version that satisfies the requirement os (from versions: none)
ERROR: No matching distribution found for os
Collecting open3d
  Downloading open3d-0.18.0-cp310-cp310-manylinux_2_27_x86_64.whl.metadata (4.2 kB)
Requirement already satisfied: numpy>=1.18.0 in /usr/local/lib/python3.10/dist-packages (from open3d) (1.26.4)
Collecting dash>=2.6.0 (from open3d)
  Downloading dash-2.18.2-py3-none-any.whl.metadata (10 kB)
Requirement already satisfied: werkzeug>=2.2.3 in /usr/local/lib/python3.10/dist-packages (from open3d) (3.1.3)
Requirement already satisfied: nbformat>=5.7.0 in /usr/local/lib/python3.10/dist-packages (from open3d) (5.10.4)
Collecting configargparse (from open3d)
  Downloading ConfigArgParse-1.7-py3-none-any.whl.metadata (23 kB)
Collecting ipywidgets>=8.0.4 (from open3d)
  Downloading ipywidgets-8.1.5-py3-none-any.whl.metadata (2.3 kB)
Collecting addict (from open3d)
  Downloading addict-2.4.0-py3-none-any.whl.metadata (1.0 kB)
Requirement already satisfied: pillow>=9.3.0 in /usr/local/lib/python3.10/dist-packages (from open3d) (11.0.0)
Requirement already satisfied: matplotlib>=3 in /usr/local/lib/python3.10/dist-packages (from open3d) (3.8.0)
Requirement already satisfied: pandas>=1.0 in /usr/local/lib/python3.10/dist-packages (from open3d) (2.2.2)
Requirement already satisfied: pyyaml>=5.4.1 in /usr/local/lib/python3.10/dist-packages (from open3d) (6.0.2)
Requirement already satisfied: scikit-learn>=0.21 in /usr/local/lib/python3.10/dist-packages (from open3d) (1.5.2)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from open3d) (4.66.6)
Collecting pyquaternion (from open3d)
  Downloading pyquaternion-0.9.9-py3-none-any.whl.metadata (1.4 kB)
Requirement already satisfied: Flask<3.1,>=1.0.4 in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (3.0.3)
Collecting werkzeug>=2.2.3 (from open3d)
  Downloading werkzeug-3.0.6-py3-none-any.whl.metadata (3.7 kB)
Requirement already satisfied: plotly>=5.0.0 in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (5.24.1)
Collecting dash-html-components==2.0.0 (from dash>=2.6.0->open3d)
  Downloading dash_html_components-2.0.0-py3-none-any.whl.metadata (3.8 kB)
Collecting dash-core-components==2.0.0 (from dash>=2.6.0->open3d)
  Downloading dash_core_components-2.0.0-py3-none-any.whl.metadata (2.9 kB)
Collecting dash-table==5.0.0 (from dash>=2.6.0->open3d)
  Downloading dash_table-5.0.0-py3-none-any.whl.metadata (2.4 kB)
Requirement already satisfied: importlib-metadata in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (8.5.0)
Requirement already satisfied: typing-extensions>=4.1.1 in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (4.12.2)
Requirement already satisfied: requests in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (2.32.3)
Collecting retrying (from dash>=2.6.0->open3d)
  Downloading retrying-1.3.4-py3-none-any.whl.metadata (6.9 kB)
Requirement already satisfied: nest-asyncio in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (1.6.0)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from dash>=2.6.0->open3d) (75.1.0)
Collecting comm>=0.1.3 (from ipywidgets>=8.0.4->open3d)
  Downloading comm-0.2.2-py3-none-any.whl.metadata (3.7 kB)
Requirement already satisfied: ipython>=6.1.0 in /usr/local/lib/python3.10/dist-packages (from ipywidgets>=8.0.4->open3d) (7.34.0)
Requirement already satisfied: traitlets>=4.3.1 in /usr/local/lib/python3.10/dist-packages (from ipywidgets>=8.0.4->open3d) (5.7.1)
Collecting widgetsnbextension~=4.0.12 (from ipywidgets>=8.0.4->open3d)
  Downloading widgetsnbextension-4.0.13-py3-none-any.whl.metadata (1.6 kB)
Requirement already satisfied: jupyterlab-widgets~=3.0.12 in /usr/local/lib/python3.10/dist-packages (from ipywidgets>=8.0.4->open3d) (3.0.13)
Requirement already satisfied: contourpy>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (0.12.1)
Requirement already satisfied: fonttools>=4.22.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (4.55.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (1.4.7)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (24.2)
Requirement already satisfied: pyparsing>=2.3.1 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (3.2.0)
Requirement already satisfied: python-dateutil>=2.7 in /usr/local/lib/python3.10/dist-packages (from matplotlib>=3->open3d) (2.8.2)
Requirement already satisfied: fastjsonschema>=2.15 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7.0->open3d) (2.20.0)
Requirement already satisfied: jsonschema>=2.6 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7.0->open3d) (4.23.0)
Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in /usr/local/lib/python3.10/dist-packages (from nbformat>=5.7.0->open3d) (5.7.2)
Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0->open3d) (2024.2)
Requirement already satisfied: tzdata>=2022.7 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.0->open3d) (2024.2)
Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21->open3d) (1.13.1)
Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21->open3d) (1.4.2)
Requirement already satisfied: threadpoolctl>=3.1.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.21->open3d) (3.5.0)
Requirement already satisfied: MarkupSafe>=2.1.1 in /usr/local/lib/python3.10/dist-packages (from werkzeug>=2.2.3->open3d) (3.0.2)
Requirement already satisfied: Jinja2>=3.1.2 in /usr/local/lib/python3.10/dist-packages (from Flask<3.1,>=1.0.4->dash>=2.6.0->open3d) (3.1.4)
Requirement already satisfied: itsdangerous>=2.1.2 in /usr/local/lib/python3.10/dist-packages (from Flask<3.1,>=1.0.4->dash>=2.6.0->open3d) (2.2.0)
Requirement already satisfied: click>=8.1.3 in /usr/local/lib/python3.10/dist-packages (from Flask<3.1,>=1.0.4->dash>=2.6.0->open3d) (8.1.7)
Requirement already satisfied: blinker>=1.6.2 in /usr/local/lib/python3.10/dist-packages (from Flask<3.1,>=1.0.4->dash>=2.6.0->open3d) (1.9.0)
Collecting jedi>=0.16 (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d)
  Downloading jedi-0.19.2-py2.py3-none-any.whl.metadata (22 kB)
Requirement already satisfied: decorator in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (4.4.2)
Requirement already satisfied: pickleshare in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.7.5)
Requirement already satisfied: prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (3.0.48)
Requirement already satisfied: pygments in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (2.18.0)
Requirement already satisfied: backcall in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.2.0)
Requirement already satisfied: matplotlib-inline in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.1.7)
Requirement already satisfied: pexpect>4.3 in /usr/local/lib/python3.10/dist-packages (from ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (4.9.0)
Requirement already satisfied: attrs>=22.2.0 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7.0->open3d) (24.2.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7.0->open3d) (2024.10.1)
Requirement already satisfied: referencing>=0.28.4 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7.0->open3d) (0.35.1)
Requirement already satisfied: rpds-py>=0.7.1 in /usr/local/lib/python3.10/dist-packages (from jsonschema>=2.6->nbformat>=5.7.0->open3d) (0.21.0)
Requirement already satisfied: platformdirs>=2.5 in /usr/local/lib/python3.10/dist-packages (from jupyter-core!=5.0.*,>=4.12->nbformat>=5.7.0->open3d) (4.3.6)
Requirement already satisfied: tenacity>=6.2.0 in /usr/local/lib/python3.10/dist-packages (from plotly>=5.0.0->dash>=2.6.0->open3d) (9.0.0)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.7->matplotlib>=3->open3d) (1.16.0)
Requirement already satisfied: zipp>=3.20 in /usr/local/lib/python3.10/dist-packages (from importlib-metadata->dash>=2.6.0->open3d) (3.21.0)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests->dash>=2.6.0->open3d) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests->dash>=2.6.0->open3d) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests->dash>=2.6.0->open3d) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests->dash>=2.6.0->open3d) (2024.8.30)
Requirement already satisfied: parso<0.9.0,>=0.8.4 in /usr/local/lib/python3.10/dist-packages (from jedi>=0.16->ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.8.4)
Requirement already satisfied: ptyprocess>=0.5 in /usr/local/lib/python3.10/dist-packages (from pexpect>4.3->ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.7.0)
Requirement already satisfied: wcwidth in /usr/local/lib/python3.10/dist-packages (from prompt-toolkit!=3.0.0,!=3.0.1,<3.1.0,>=2.0.0->ipython>=6.1.0->ipywidgets>=8.0.4->open3d) (0.2.13)
Downloading open3d-0.18.0-cp310-cp310-manylinux_2_27_x86_64.whl (399.7 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 399.7/399.7 MB 4.0 MB/s eta 0:00:00
Downloading dash-2.18.2-py3-none-any.whl (7.8 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 7.8/7.8 MB 85.2 MB/s eta 0:00:00
Downloading dash_core_components-2.0.0-py3-none-any.whl (3.8 kB)
Downloading dash_html_components-2.0.0-py3-none-any.whl (4.1 kB)
Downloading dash_table-5.0.0-py3-none-any.whl (3.9 kB)
Downloading ipywidgets-8.1.5-py3-none-any.whl (139 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 139.8/139.8 kB 14.0 MB/s eta 0:00:00
Downloading werkzeug-3.0.6-py3-none-any.whl (227 kB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 228.0/228.0 kB 20.6 MB/s eta 0:00:00
Downloading addict-2.4.0-py3-none-any.whl (3.8 kB)
Downloading ConfigArgParse-1.7-py3-none-any.whl (25 kB)
Downloading pyquaternion-0.9.9-py3-none-any.whl (14 kB)
Downloading comm-0.2.2-py3-none-any.whl (7.2 kB)
Downloading widgetsnbextension-4.0.13-py3-none-any.whl (2.3 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 2.3/2.3 MB 68.0 MB/s eta 0:00:00
Downloading retrying-1.3.4-py3-none-any.whl (11 kB)
Downloading jedi-0.19.2-py2.py3-none-any.whl (1.6 MB)
   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 57.3 MB/s eta 0:00:00
Installing collected packages: dash-table, dash-html-components, dash-core-components, addict, widgetsnbextension, werkzeug, retrying, pyquaternion, jedi, configargparse, comm, ipywidgets, dash, open3d
  Attempting uninstall: widgetsnbextension
    Found existing installation: widgetsnbextension 3.6.10
    Uninstalling widgetsnbextension-3.6.10:
      Successfully uninstalled widgetsnbextension-3.6.10
  Attempting uninstall: werkzeug
    Found existing installation: Werkzeug 3.1.3
    Uninstalling Werkzeug-3.1.3:
      Successfully uninstalled Werkzeug-3.1.3
  Attempting uninstall: ipywidgets
    Found existing installation: ipywidgets 7.7.1
    Uninstalling ipywidgets-7.7.1:
      Successfully uninstalled ipywidgets-7.7.1
Successfully installed addict-2.4.0 comm-0.2.2 configargparse-1.7 dash-2.18.2 dash-core-components-2.0.0 dash-html-components-2.0.0 dash-table-5.0.0 ipywidgets-8.1.5 jedi-0.19.2 open3d-0.18.0 pyquaternion-0.9.9 retrying-1.3.4 werkzeug-3.0.6 widgetsnbextension-4.0.13
Collecting git+https://github.com/facebookresearch/segment-anything.git
  Cloning https://github.com/facebookresearch/segment-anything.git to /tmp/pip-req-build-d0dgdqga
  Running command git clone --filter=blob:none --quiet https://github.com/facebookresearch/segment-anything.git /tmp/pip-req-build-d0dgdqga
  Resolved https://github.com/facebookresearch/segment-anything.git to commit dca509fe793f601edb92606367a655c15ac00fdf
  Preparing metadata (setup.py) ... done
Building wheels for collected packages: segment_anything
  Building wheel for segment_anything (setup.py) ... done
  Created wheel for segment_anything: filename=segment_anything-1.0-py3-none-any.whl size=36592 sha256=10ef3c7f494500f63259acb932f22bdc07710916df4afcd7ac5d2271a9eec823
  Stored in directory: /tmp/pip-ephem-wheel-cache-ty7okx0o/wheels/10/cf/59/9ccb2f0a1bcc81d4fbd0e501680b5d088d690c6cfbc02dc99d
Successfully built segment_anything
Installing collected packages: segment_anything
Successfully installed segment_anything-1.0
--2024-12-01 04:16:55--  https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
Resolving dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)... 13.227.219.59, 13.227.219.70, 13.227.219.10, ...
Connecting to dl.fbaipublicfiles.com (dl.fbaipublicfiles.com)|13.227.219.59|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2564550879 (2.4G) [binary/octet-stream]
Saving to: β€˜ckpts/sam_vit_h_4b8939.pth’

sam_vit_h_4b8939.pt 100%[===================>]   2.39G   156MB/s    in 14s     

2024-12-01 04:17:10 (170 MB/s) - β€˜ckpts/sam_vit_h_4b8939.pth’ saved [2564550879/2564550879]

Requirement already satisfied: gdown in /usr/local/lib/python3.10/dist-packages (5.2.0)
Requirement already satisfied: beautifulsoup4 in /usr/local/lib/python3.10/dist-packages (from gdown) (4.12.3)
Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from gdown) (3.16.1)
Requirement already satisfied: requests[socks] in /usr/local/lib/python3.10/dist-packages (from gdown) (2.32.3)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from gdown) (4.66.6)
Requirement already satisfied: soupsieve>1.2 in /usr/local/lib/python3.10/dist-packages (from beautifulsoup4->gdown) (2.6)
Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (3.4.0)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (3.10)
Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (2.2.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (2024.8.30)
Requirement already satisfied: PySocks!=1.5.7,>=1.5.6 in /usr/local/lib/python3.10/dist-packages (from requests[socks]->gdown) (1.7.1)
Downloading...
From (original): https://drive.google.com/uc?id=1K375xNjWAwZ7kmhjTuJccC3y6Ik5tN4q
From (redirected): https://drive.google.com/uc?id=1K375xNjWAwZ7kmhjTuJccC3y6Ik5tN4q&confirm=t&uuid=4186b7f4-3ef1-451e-a2af-83a45e9bdbd2
To: /content/images.zip
100% 118M/118M [00:02<00:00, 57.9MB/s]
Archive:  images.zip
   creating: images/
  inflating: images/data_demo.gif    
   creating: images/dataset/
  inflating: images/dataset/transforms_train.json  
   creating: images/dataset/train/
  inflating: images/dataset/train/r_58_0.png  
  inflating: images/dataset/train/r_20_0.png  
  inflating: images/dataset/train/r_30_0.png  
  inflating: images/dataset/train/r_47_0.png  
  inflating: images/dataset/train/r_22_0.png  
  inflating: images/dataset/train/r_48_0.png  
  inflating: images/dataset/train/r_4_0.png  
  inflating: images/dataset/train/r_40_0.png  
  inflating: images/dataset/train/r_2_0.png  
  inflating: images/dataset/train/r_3_0.png  
  inflating: images/dataset/train/r_68_0.png  
  inflating: images/dataset/train/r_33_0.png  
  inflating: images/dataset/train/r_15_0.png  
  inflating: images/dataset/train/r_71_0.png  
  inflating: images/dataset/train/r_66_0.png  
  inflating: images/dataset/train/r_99_0.png  
  inflating: images/dataset/train/r_72_0.png  
  inflating: images/dataset/train/r_13_0.png  
  inflating: images/dataset/train/r_54_0.png  
  inflating: images/dataset/train/r_96_0.png  
  inflating: images/dataset/train/r_81_0.png  
  inflating: images/dataset/train/r_78_0.png  
  inflating: images/dataset/train/r_41_0.png  
  inflating: images/dataset/train/r_52_0.png  
  inflating: images/dataset/train/r_29_0.png  
  inflating: images/dataset/train/r_56_0.png  
  inflating: images/dataset/train/r_16_0.png  
  inflating: images/dataset/train/depth.npy  
  inflating: images/dataset/train/r_43_0.png  
  inflating: images/dataset/train/r_93_0.png  
  inflating: images/dataset/train/r_53_0.png  
  inflating: images/dataset/train/r_90_0.png  
  inflating: images/dataset/train/r_89_0.png  
  inflating: images/dataset/train/r_17_0.png  
  inflating: images/dataset/train/r_77_0.png  
  inflating: images/dataset/train/r_75_0.png  
  inflating: images/dataset/train/r_39_0.png  
  inflating: images/dataset/train/r_80_0.png  
  inflating: images/dataset/train/r_79_0.png  
  inflating: images/dataset/train/r_36_0.png  
  inflating: images/dataset/train/r_92_0.png  
  inflating: images/dataset/train/r_67_0.png  
  inflating: images/dataset/train/r_83_0.png  
  inflating: images/dataset/train/r_70_0.png  
  inflating: images/dataset/train/r_10_0.png  
  inflating: images/dataset/train/r_88_0.png  
  inflating: images/dataset/train/r_50_0.png  
  inflating: images/dataset/train/r_1_0.png  
  inflating: images/dataset/train/r_12_0.png  
  inflating: images/dataset/train/r_37_0.png  
  inflating: images/dataset/train/r_94_0.png  
  inflating: images/dataset/train/r_23_0.png  
  inflating: images/dataset/train/r_18_0.png  
  inflating: images/dataset/train/r_55_0.png  
  inflating: images/dataset/train/r_25_0.png  
  inflating: images/dataset/train/r_21_0.png  
  inflating: images/dataset/train/r_34_0.png  
  inflating: images/dataset/train/r_19_0.png  
  inflating: images/dataset/train/r_31_0.png  
  inflating: images/dataset/train/r_74_0.png  
  inflating: images/dataset/train/r_98_0.png  
  inflating: images/dataset/train/r_7_0.png  
  inflating: images/dataset/train/r_9_0.png  
  inflating: images/dataset/train/r_14_0.png  
  inflating: images/dataset/train/r_38_0.png  
  inflating: images/dataset/train/r_87_0.png  
  inflating: images/dataset/train/r_26_0.png  
  inflating: images/dataset/train/r_24_0.png  
  inflating: images/dataset/train/r_62_0.png  
  inflating: images/dataset/train/r_85_0.png  
  inflating: images/dataset/train/r_5_0.png  
  inflating: images/dataset/train/r_28_0.png  
  inflating: images/dataset/train/r_42_0.png  
  inflating: images/dataset/train/r_86_0.png  
  inflating: images/dataset/train/r_46_0.png  
  inflating: images/dataset/train/r_49_0.png  
  inflating: images/dataset/train/r_61_0.png  
  inflating: images/dataset/train/r_82_0.png  
  inflating: images/dataset/train/r_44_0.png  
  inflating: images/dataset/train/r_91_0.png  
  inflating: images/dataset/train/r_65_0.png  
  inflating: images/dataset/train/r_32_0.png  
  inflating: images/dataset/train/r_84_0.png  
  inflating: images/dataset/train/r_73_0.png  
  inflating: images/dataset/train/r_35_0.png  
  inflating: images/dataset/train/r_76_0.png  
  inflating: images/dataset/train/r_59_0.png  
  inflating: images/dataset/train/r_11_0.png  
  inflating: images/dataset/train/r_0_0.png  
  inflating: images/dataset/train/r_95_0.png  
  inflating: images/dataset/train/r_27_0.png  
  inflating: images/dataset/train/r_8_0.png  
  inflating: images/dataset/train/r_6_0.png  
  inflating: images/dataset/train/r_60_0.png  
  inflating: images/dataset/train/r_45_0.png  
  inflating: images/dataset/train/r_57_0.png  
  inflating: images/dataset/train/r_51_0.png  
  inflating: images/dataset/train/r_63_0.png  
  inflating: images/dataset/train/r_69_0.png  
  inflating: images/dataset/train/r_64_0.png  
  inflating: images/dataset/train/r_97_0.png  
  inflating: images/expected_output.png  
  inflating: images/cam_frames.png   
InΒ [3]:
from IPython.display import Image
img_size = 400
Image(filename="images/data_demo.gif", width=img_size, height=img_size)
Out[3]:
<IPython.core.display.Image object>

Environment Set-up without Google ColabΒΆ

If you're not running on Google Colab, use the prep_no_colab.sh script to install the right libraries, pull the model checkpoint and download the data. This script was tested on Ubuntu Linux only. After running the script your folder should look something like this:

β”œβ”€β”€ images
β”‚   β”œβ”€β”€ dataset
β”‚   β”‚   β”œβ”€β”€ train
|   |   β”œβ”€β”€ test
|   |   β”œβ”€β”€ val
β”‚   β”‚   β”œβ”€β”€ transforms_train.json
β”‚   β”‚   β”œβ”€β”€ transforms_test.json
β”‚   β”‚   β”œβ”€β”€ transforms_val.json
β”œβ”€β”€ ckpts
β”‚   β”œβ”€β”€ sam_vit_h_4b8939.pth

Our recommended method for loading iPython notebooks on your local computer is to use a Visual Studio Code plugin, here is a short tutorial on how to do that. Are you having issues setting up your system? Problems with Cuda versions? Use Colab instead, or ask a TA if you really want to use your own compute.

Set-upΒΆ

Necessary imports and helper functions for displaying points, boxes, and masks.

InΒ [4]:
%matplotlib inline

import numpy as np
import torch
import matplotlib.pyplot as plt
import cv2
import sys
import os
InΒ [5]:
def show_mask(mask, ax, random_color=False):
    # This function is used to visualize the mask on the image in a matplotlib axis.
    # bool mask: (H, W). True for each pixel that belongs to the object.
    # ax: matplotlib axis
    # random_color: if True, use a random color for the mask. Otherwise, use blue.

    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30/255, 144/255, 255/255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)

def show_box(box, ax):
    # This function is used to visualize the bounding box on the image in a matplotlib axis.
    # box: (4,) array. [x0, y0, x1, y1]
    # (x0, y0): top-left corner
    # (x1, y1): bottom-right corner
    # ax: matplotlib axis

    x0, y0 = box[0], box[1]
    w, h = box[2] - box[0], box[3] - box[1]
    ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2))

Q1: Loading and Understanding the Dataset [2 pts]ΒΆ

The dataset you'll be working with is a synthetic dataset generated specifically for this homework. We used the free and open-source 3D graphics software tool Blender to render images from 100 different poses. You will have access to the following data:

  • The intrinsics parameters, constant for all images. The dataset was rendered using a pinhole camera model without distortion. Therefore, the intrinsics can be captured using camera matrix K.
  • The extrinsics, as a [100 x 4 x 4] array. Each [4 x 4] matrix gives the cam2world transformation.
  • File path to the 100 images, each of shape [800 x 800 x 3].
  • 100 depth images, each of shape [800 x 800 x 3]. Each channel has the depth in meters, so 2 channels are redundant.

We will now load and visualize the dataset.

Q1.1 Compute the camera matrix K [2 pts]ΒΆ

InΒ [6]:
import json

# dataset class provided to load extrinsics, intrinsics and image paths.
class Dataset:
    def __init__(self, json):
        self.json = json # the json file containing the extrinsics, intrinsics and image paths.
        self.load_extrinsics()
        self.load_intrinsics()
        self.compute_intrinsics()

    def load_extrinsics(self):
        # This function loads the extrinsics parameters from the json file.

        with open(self.json) as f:
            self.data = json.load(f)
        self.frames = self.data['frames'] # 'frames' in the json contains the extrinsics and image path for each image.
        self.transforms = np.array([frame['transform_matrix'] for frame in self.frames]) # extrinsic matrix for each image, shape (N, 4, 4)
        self.file_paths = np.array([frame['file_path'] for frame in self.frames]) # path to each image, shape (N,)

    def load_intrinsics(self):
        # This function loads the intrinsics parameters from the json file.
        self.f_x = self.data['fl_x'] # focal length in x
        self.f_y = self.data['fl_y'] # focal length in y
        self.w = self.data['w'] # image width
        self.h = self.data['h'] # image height
        self.cx = self.data['cx'] # principal point in x
        self.cy = self.data['cy'] # principal point in y

    def compute_intrinsics(self):
        # self.K = None # K: the intrinsic matrix, shape (3, 3)
        # compute the K matrix form the intrinsic parameters computed in load_intrinsics() : [2 pts]
        # TODO: YOUR CODE HERE
        self.K = np.array([[self.f_x, 0, self.cx],
                           [0, self.f_y, self.cy],
                           [0, 0, 1]])

dataset = Dataset('images/dataset/transforms_train.json') # load the dataset
np.set_printoptions(precision=3, suppress=True) # do NOT remove this line when you print matrices for grading

print('Shape of extrinsic matrices: {}'.format(dataset.transforms.shape)) # all extrinsic matrices, shape (N, 4, 4)
#TODO: print the intrinsic matrix and add to your gradescope submission.
print('K matrix {}'.format(dataset.K)) # The intrinsic matrix K you computed.
Shape of extrinsic matrices: (100, 4, 4)
K matrix [[1111.111    0.     400.   ]
 [   0.    1111.111  400.   ]
 [   0.       0.       1.   ]]

Visualize the dataset [0 pts]ΒΆ

Here we show the RGB and depth data that are part of the dataset. Reasoning about algorithm design is often easier when you understand the data.

InΒ [7]:
image = cv2.imread(os.path.join('images/dataset',dataset.file_paths[0]))
image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
InΒ [8]:
plt.figure(figsize=(10,10))
plt.imshow(image)
plt.axis('on')
plt.show()
No description has been provided for this image
InΒ [9]:
def depthmap_viz(depth,min_d=0.0,max_d=3.5):
    # depth: (H,W,3) - depth map, every channel contains the same depth values for that pixel. 2 channels are redundant.

    # min_d: minimum depth value to visualize
    # max_d: maximum depth value to visualize

    depth = np.clip(depth,min_d,max_d)

    depth = (depth-min_d)/(max_d - min_d)

    image = depth

    plt.clf()
    plt.imshow(depth,cmap='magma', vmin=min_d,vmax=max_d)
InΒ [10]:
depth_location = 'images/dataset/train/depth.npy' # location of ground truth depth maps.
depths = np.load(depth_location) # load the depth maps

depthmap_viz(depths[0]) # visualize the first depth map
plt.show() # show the plot
No description has been provided for this image

Loading SAM [0 pts]ΒΆ

The Segment Anything Model (SAM) produces high quality object masks from input prompts such as points or boxes, and it can be used to generate masks for all objects in an image. It has been trained on a dataset of 11 million images and 1.1 billion masks, and has strong zero-shot performance on a variety of segmentation tasks.

Here we load the SAM model and predictor. Running on CUDA and using the default model are recommended for best results. Do not change any settings, as it will complicate our grading, you may not receive full credit if you alter the SAM settings.

InΒ [11]:
import sys
from segment_anything import sam_model_registry, SamPredictor

sam_checkpoint = "ckpts/sam_vit_h_4b8939.pth" # the checkpoint loaded in the setup section.
model_type = "vit_h"

device = "cuda" # loading to GPU.

sam = sam_model_registry[model_type](checkpoint=sam_checkpoint)
sam.to(device=device)

predictor = SamPredictor(sam)
/usr/local/lib/python3.10/dist-packages/segment_anything/build_sam.py:105: FutureWarning: You are using `torch.load` with `weights_only=False` (the current default value), which uses the default pickle module implicitly. It is possible to construct malicious pickle data which will execute arbitrary code during unpickling (See https://github.com/pytorch/pytorch/blob/main/SECURITY.md#untrusted-models for more details). In a future release, the default value for `weights_only` will be flipped to `True`. This limits the functions that could be executed during unpickling. Arbitrary objects will no longer be allowed to be loaded via this mode unless they are explicitly allowlisted by the user via `torch.serialization.add_safe_globals`. We recommend you start setting `weights_only=True` for any use case where you don't have full control of the loaded file. Please open an issue on GitHub for any issues related to this experimental feature.
  state_dict = torch.load(f)

Q2: Designing our prompt [2 point]ΒΆ

SAM and other segmentation models use input prompts such as a box around an object to generate a mask of the object. From now on if we refer to a prompt we mean the box around an object that is to be segmented, which is the input to SAM.

In this homework we will start by acquiring a single user-generated prompt in one image: a box around the coffee mug. We will then use depth and camera geometry to propogate the mask to other frames. It is therefore important for the one user-specified prompt to be high-quality. Set the input_box parameter such that we can get a high-quality segmentation.

Now we load in the first example image into the model.

InΒ [12]:
#TODO: YOUR CODE HERE
# add this plot to gradescope submission.
# input_box = None # choose correct bounding box [2 pts]
input_box = np.array([350, 250, 450, 350])

plt.figure(figsize=(10, 10))
plt.imshow(image)
show_box(input_box, plt.gca())
plt.axis('off')
plt.show()
No description has been provided for this image
InΒ [13]:
# Here we're running SAM on the image with the bounding box.
predictor.set_image(image) # loading the image to the predictor.
masks, _, _ = predictor.predict(
    point_coords=None,
    point_labels=None,
    box=input_box[None, :],
    multimask_output=False,
)
# Calling the predictor with the bounding box.
# You will not need to change any of the other arguments in this homework.

Visualizing the mask.ΒΆ

InΒ [14]:
mask = masks[0]
h, w = mask.shape[-2:]
mask_image = mask.reshape(h, w, 1)

plt.figure(figsize=(10, 10))
plt.imshow(image)
show_mask(mask, plt.gca())
show_box(input_box, plt.gca())
plt.axis('off')
plt.show()
No description has been provided for this image

Q3: Project to 3D [20 points]ΒΆ

As discussed before, the aim of this homework is to use the image mask in one image and propagate it to novel views. In this section we will use the mask generated in the previous question, and project the pixels in the mask to 3D coordinates using depth and camera geometry. Remember we can project points in image coordinates to 3D points by using the P matrix, with P = K[R|t].

Q3.1: Image frame to camera frame [10 pts]ΒΆ

In this part of the question you will be asked to project the points from image frame, so pixel coordinates, to a point cloud in the camera frame. For this you will only need the intrinsic matrix K and the depth. You will not need dataset.transforms in Q2.1.

InΒ [15]:
def img2cam(points, K, depths=None):
    # project the points from image coordinates to camera coordinates [5 pts]
    cam_3d = None

    # steps todo:
    # (1) Use the intrinsic matrix K to convert the points from image coordinates to a point cloud in the camera frame.
    # (2) Normalize the points to a plane with z=1.
    # (3) Use depths to scale the points to be at the correct distance from the camera.

    # TODO: YOUR CODE HERE
    # (1) Use the intrinsic matrix K to convert the points from image coordinates to a point cloud in the camera frame.
    cam_3d = np.concatenate([points, np.ones((points.shape[0],1))], axis=1)
    cam_3d = cam_3d.T
    cam_3d = np.linalg.inv(K) @ cam_3d

    # (2) Normalize the points to a plane with z=1.
    cam_3d_norm = cam_3d / cam_3d[2, :] # normalize to z=1

    # (3) Use depths to scale the points to be at the correct distance from the camera.
    if depths is not None:
        depths = depths.reshape(1, -1)
        cam_3d_norm = cam_3d_norm * depths

    cam_3d = cam_3d_norm

    return cam_3d


def filter_points(coords, depths, thresh=2.55):
    # filter out points that are too far away in the first mask, this first mask will be very important! [2 pts]
    # return filtered coords and depths
    # don't make it too complicated, this should be a one-liner.

    # TODO: YOUR CODE HERE
    return coords[depths.squeeze() < thresh], depths.squeeze()[depths.squeeze() < thresh]


def mask2cam(mask, K, depths, thresh=None):
    # project mask points to camera frame [3 pts]
    # steps todo:
    # (1) get all coordinates where the mask is True, this should be N x 2
    # (2) get the depth values for these coordinates, Nx1
    # (3) call filter_points to filter out points that are too far away, with depth above the treshold.
    # only call filter_points if thresh is not None
    # Here far away means the depth is above a certain threshold.
    # (4) call img2cam to convert the points to camera frame using intrinsics and depth.

    # # TODO: YOUR CODE HERE
    # (1) get all coordinates where the mask is True, this should be N x 2
    coords = np.argwhere(mask)[:, :2]

    # (2) get the depth values for these coordinates, Nx1
    depth_values = depths[coords[:, 0], coords[:, 1], 1].reshape(-1, 1)

    # # Print varifications
    # print("Coordinates shape:", coords.shape)
    # print("Depth values shape:", depth_values.shape)

    # (3) call filter_points to filter out points that are too far away, with depth above the treshold.
    if thresh is not None:
        coords, depth_values = filter_points(coords, depth_values, thresh)

    # (4) call img2cam to convert the points to camera frame using intrinsics and depth.
    cam_3d = img2cam(coords, K, depth_values)

    return np.vstack((cam_3d[1], cam_3d[0], cam_3d[2]))
InΒ [16]:
cam_pnts_3d = mask2cam(mask_image,dataset.K,depths[0],thresh=2.55)
InΒ [17]:
def viz_pts_3d(pts,xrange=None,yrange=None,zrange=None,title=None):
    # viz the 3D points
    fig = plt.figure(figsize=(10,10))
    ax = fig.add_subplot(111, projection='3d')
    ax.scatter(pts[0,:],pts[1,:],pts[2,:],s=1)
    ax.set_xlabel('X [m]')
    ax.set_ylabel('Y [m]')
    ax.set_zlabel('Z [m]')

    if xrange is not None:
        ax.set_xlim(xrange)
    if yrange is not None:
        ax.set_ylim(yrange)
    if zrange is not None:
        ax.set_zlim(zrange)

    if title is not None:
        ax.set_title(title)
    plt.show()

np.save('cam_pnts_3d.npy',cam_pnts_3d)

#TODO: add this plot to gradescope submission
viz_pts_3d(cam_pnts_3d)
No description has been provided for this image

Q3.2: Camera frame to world frame [10 points]ΒΆ

We now project the points in the camera frame to the world frame, keep in mind that the transforms provided are between the world frame and the Blender camera frame (pictured below). You will need to take this difference into account when using the equations of projective geometry. For example, the intrinsics matrix K will expect 3D points in the OpenCV camera frame. Summarizing, these are the existing frames:

  • Camera frame/OpenCV Camera Frame: This is the reference frame for 3D points with respect to the camera, as seen in the class up to now.
  • Blender Camera Frame: This is the camera frame for 3D points used in Blender, the y and z axes are pointing in opposite directions w.r.t. OpenCV frame.
  • World Frame: This is the frame for 3D points with respect to the world origin.
  • Image Frame: The 2D points in the image, e.g., u, v in range [0,H] and [0,W].
InΒ [18]:
Image(filename="images/cam_frames.png", width=img_size)
Out[18]:
No description has been provided for this image
InΒ [19]:
def cam2world(points, transform):
    # project camera coordinates to world coordinates [5 pts]
    # NOTE: transform is the transformation from the blender camera frame to the world frame.
    # TODO: YOUR CODE HERE

    # Flip Y and Z axes
    points = points * np.array([1, -1, -1]).reshape(3, 1)

    # Homogenous coordinates
    homo_points = np.concatenate([points, np.ones((1, points.shape[1]))], axis=0)

    # Aoply transformation
    world_points = np.dot(transform, homo_points)

    return world_points

def world2cam(points, transform):
    # project world coordinates to camera coordinates [5 pts]
    # NOTE: do not use np.linalg.inv to compute the inverse of transform, we will award only partial credit.
    # There is an intuitive and elegent way to compute the inverse of transform.
    # NOTE: do not forget about blender coordinates!
    # TODO: YOUR CODE HERE

    # Rotation and Translation matrix
    R = transform[:3, :3]
    T = transform[:3, 3]

    # Inverse rotation and translation
    R_inv = R.T
    T_inv = -np.dot(R_inv, T)

    # Inverse transformation matrix
    inverse_transform = np.hstack((np.vstack((R_inv, [0, 0, 0])), np.hstack((T_inv,[1])).reshape(-1,1)))

    # Apply inverse transformation
    cam_points = np.dot(inverse_transform, points)

    # Normalize the points
    cam_points /= cam_points[3]

    # Flip Y and Z to match OpenCV coordinates
    cam_points = np.vstack((cam_points[0], -cam_points[1], -cam_points[2]))

    return cam_points
InΒ [20]:
def show_mask(mask, ax, random_color=False):
    # This function is used to visualize the mask on the image in a matplotlib axis.
    # bool mask: (H, W). True for each pixel that belongs to the object.
    # ax: matplotlib axis
    # random_color: if True, use a random color for the mask. Otherwise, use blue.

    if random_color:
        color = np.concatenate([np.random.random(3), np.array([0.6])], axis=0)
    else:
        color = np.array([30/255, 144/255, 255/255, 0.6])
    h, w = mask.shape[-2:]
    mask_image = mask.reshape(h, w, 1) * color.reshape(1, 1, -1)
    ax.imshow(mask_image)

def show_box(box, ax):
    # This function is used to visualize the bounding box on the image in a matplotlib axis.
    # box: (4,) array. [x0, y0, x1, y1]
    # ax: matplotlib axis

    x0, y0 = box[0], box[1]
    w, h = box[2] - box[0], box[3] - box[1]
    ax.add_patch(plt.Rectangle((x0, y0), w, h, edgecolor='green', facecolor=(0,0,0,0), lw=2))
InΒ [22]:
transform_0 = dataset.transforms[0]

world_pts = cam2world(cam_pnts_3d,transform_0)

np.save('world_pts.npy',world_pts)
#TODO: add this plot to gradescope submission
viz_pts_3d(world_pts)
No description has been provided for this image

Q4: Casting masks to new frames [10 pts]ΒΆ

Now we will look at new viewpoints and extract their point clouds. To do this we will loop through new viewpoints one by one, to run SAM on each new image. Each iteration of the loop we add the new 3D points to the previous 3D points, which are then used in the next iteration to create a new mask using projective geometry and depth.

A simple approach might be to deploy the projective geometry we have developed, and find the bounding box at each iteration based on the coordinate range of the projected point cloud. That is, we project all 3D points to the novel view, an array of [N,2] with each row an x and y coordinate in the image frame. The bounding box could be simply [min_x,min_y,max_x,max_y]. What would be the problem with an approach like this?

Occlussions and noise! Noise can be from some pixels that weren't segmented correctly, occlussions can cause entirely errorneous masks. You will be implementing several functions to improve the results:
(1) Filter out masks that have a confidence that's too low
(2) Filter out points that are too far away from anything else we've seen so far.
(3) Carefully choose the bounding box dimensions to avoid noise to have a big effect on the result.

InΒ [23]:
def cam2img(points, K):
    # project camera coordinates to image coordinates [3 pts]
    # output should be pixel coordinates in the correct range with shape (2, N)

    # TODO: YOUR CODE HERE

    # Apply camera intrinsic
    img_points = np.dot(K, points)

    # Normalize to pixel coordinates
    img_points = img_points[:2, :] / img_points[2, :]

    return img_points
InΒ [24]:
# (1) Filter out masks that have a confidence that's too low. [0 pts]

score_thresh = 0.85
def keep_score(score):
    return score > score_thresh
InΒ [30]:
# (2) Filter out points that are too far away from anything else we've seen so far.

dist_thresh = 0.1 # decide on a good distance threshold [2 pts]

def keep_dist(new_pts, existing_pts):
    # TODO: YOUR CODE HERE
    # compute median of all_world_pts
    # compute distance between tmp_world_pts and median of all_world_pts
    # reject outliers that are too far away from all other points

    # compue median of all_world_pts
    median_all_world_pts = np.median(existing_pts, axis=1, keepdims=True)

    # compute distance between tmp_world_pts and median of all_world_pts
    distance = np.linalg.norm(new_pts - median_all_world_pts.reshape(-1, 1), axis=0)

    # reject outliers that are too far away from all other points
    keep_mask = distance < dist_thresh

    return new_pts[:, keep_mask]
InΒ [31]:
# (3) Carefully choose the bounding box dimensions to avoid noise to have a big effect on the result.

# filter out points n_std away from mean of all points [3 pts]
def filter_for_box(world_points,transform, n_std=2):
    # TODO: YOUR CODE HERE
    # compute mean and std of all points [2 pts]
    # filter out points that are n_std away from mean of all points [3 pts]
    # transform to cam frame [1 pt]
    # you will need intrinscis matrix K here, simply call dataset.K

    # compute mean of all points
    mean = np.mean(world_points, axis=1)

    # Translate points by subtracting the mean (zero-center)
    translated_points = world_points - mean.reshape(-1, 1)

    # compute std of all points
    std = np.std(translated_points, axis=0)

    # filter out points that are n_std away from mean of all points
    keep_mask = np.abs(std) < n_std

    # apply mask
    filtered_world_points = world_points[:, keep_mask]

    # # Debug: Check the shape after filtering
    # print(f"Shape of filtered_world_points: {filtered_world_points.shape}")

    # transform to cam frame
    cam_points = world2cam(filtered_world_points, transform)

    # you will need intrinscis matrix K here, simply call dataset.K
    img_points = cam2img(cam_points, dataset.K)

    return img_points

# based on the filtered points, compute the bounding box [2 pts]
def prompt_points_to_box(prompt):
    # TODO: YOUR CODE HERE
    # output: np.array([x0, y0, x1, y1]])
    # (x0, y0): top-left corner
    # (x1, y1): bottom-right corner

    # Check if prompt contains points
    if len(prompt[0]) == 0 or len(prompt[1]) == 0:
        print("Error: No points available to compute bounding box.")
        return None

    # Remove outliers
    x, y = np.sort(prompt[0]), np.sort(prompt[1])

    # Check if after trimming, the arrays are not empty
    if len(x) == 0 or len(y) == 0:
        print("Error: After trimming, no points available.")
        return None

    # Remove outliers
    x, y = x[10:-10], y[10:-10]

    # compute bounding box
    x0 = np.min(x)
    y0 = np.min(y)
    x1 = np.max(x)
    y1 = np.max(y)

    return np.array([np.min(x), np.min(y), np.max(x), np.max(y)])
InΒ [32]:
import os
import copy
from sklearn.cluster import KMeans
all_world_pts = copy.deepcopy(world_pts)

it = 1

# show_its = [1,4,10]
show_its = [1,4,10,50,75,99]

end_idx = -1 # set this to a different number, e.g. 10, for faster debugging

#TODO: add all plots generated to gradescope submission
for transform, file_path, depth in zip(dataset.transforms[1:end_idx], dataset.file_paths[1:end_idx], depths[1:end_idx]):
    # compute 3d points
    image = cv2.imread(os.path.join('images/dataset/', file_path))
    image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)

    prompt_points = filter_for_box(all_world_pts, transform) # (3) Carefully choose the bounding box dimensions to avoid noise to have a big effect on the result.

    if it in show_its:
        plt.imshow(image)
        # scatter plot the img_pts
        plt.scatter(prompt_points[0,:],prompt_points[1,:],s=5)
        plt.title('It {}, prompt points.'.format(it))
        plt.show()

    predictor.set_image(image)

    prompt_box = prompt_points_to_box(prompt_points) # (3) Choose the bounding box points based on the filtered points.

    masks, scores, _ = predictor.predict(
        box=prompt_box,
        point_labels=[1],
        multimask_output=False,
    )

    if it in show_its:
        plt.figure(figsize=(10,10))
        plt.imshow(image)
        show_mask(masks, plt.gca())
        show_box(prompt_box, plt.gca())
        plt.axis('off')
        plt.title('It {}, mask.'.format(it))
        plt.show()

    mask = masks.reshape((h, w, 1))
    cam_pnts_3d = mask2cam(mask,dataset.K,depth)
    tmp_world_pts = cam2world(cam_pnts_3d,transform)

    if keep_score(scores): # (1) Filter out masks that have a score that's too low.
        tmp_world_pts = keep_dist(tmp_world_pts,all_world_pts) # (2) Filter out points that are too far away from anything else we've seen so far.
        all_world_pts = np.hstack([all_world_pts,tmp_world_pts])

    if it in show_its:
        viz_pts_3d(all_world_pts,title='It {}, all points found so far.'.format(it),xrange=[-0.1,0.1],yrange=[-0.45,-0.25],zrange=[0.06,0.15])

    it+=1
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Visualize all 3D pointsΒΆ

InΒ [33]:
#TODO: do NOT need to add to gradescope submission
viz_pts_3d(all_world_pts,xrange=[-0.1,0.1],yrange=[-0.45,-0.25],zrange=[0.06,0.15])
No description has been provided for this image

Q5: Statistical outlier removal [3 pts]ΒΆ

InΒ [34]:
def filter_points(points,nb_neighbors=20,std_ratio=2.0):
    import open3d as o3d
    # filter points using open3d statistical outlier removal. [3 pts]
    # TODO: YOUR CODE HERE

    # Create point cloud data
    pcd = o3d.geometry.PointCloud()
    pcd.points = o3d.utility.Vector3dVector(points.T)

    # Run statistical outlier removal
    pcd_filtered, ind = pcd.remove_statistical_outlier(nb_neighbors, std_ratio)

    return np.asarray(pcd_filtered.points).T


all_world_pts_filtered = filter_points(all_world_pts[:3,:])

#TODO: add this plot to gradescope submission
viz_pts_3d(all_world_pts_filtered,xrange=[-0.1,0.1],yrange=[-0.45,-0.25],zrange=[0.06,0.15])
No description has been provided for this image

Expected OutputΒΆ

Below is the output we achieved at the end of the homework, you implementation should be similar to receive full credit.

InΒ [Β ]:
Image(filename="images/expected_output.png", width=img_size, height=img_size)
Out[Β ]:
No description has been provided for this image

Q6 Extra credit: 3D Segmentation without Ground Truth Depth [10 pts Max]ΒΆ

So far we have provided you with ground truth depth from the 3D rendering toolbox. For a maximum of 10 extra points, can you achieve similar accuracy without using ground truth depth? You can use any toolbox/repository you like, as long as they infer dense depth maps: depth for every pixel in the image. Here is one approach we believe would be relatively straightforward:

  • The dataset is formatted for Neural Radiance Fields (NeRFs). You should be able to run NeRF on this dataset with little modifications.
  • Suggested NeRF pipeline: Torch-NGP. It runs fast using a cuda backend, but all the high-level features are implemented in Torch.
  • Some necessary changes: (1) Modify the code to only use the training data, no testing or validation (not provided in dataset). You could also copy training data to a test and validation folder and create the necessary json files. (2) Modify the code to return all train depth maps in a .npy array, in units [meters].

Any other methods are allowed and encouraged, as long as they infer dense depth maps: depth for every pixel in the image. Keep in mind the difference between z-depth as in Blender (the distance along the cameras principle axis), depth as euclidean distance from the camera center.